Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean) by stukenov · Pull Request #1364 · openai/parameter-golf

stukenov · 2026-04-04T23:13:00Z

Summary

val_bpb: 1.1025 (3-seed mean, std 0.0011)
Artifact: <16 MB (max 15,985,137)
Training: 600s on 8xH100 SXM | Eval: ~500s

Beats merged SOTA (PR #1019, 1.1147) by 0.0122 BPB = 0.0206 nats (4x the 0.005-nat threshold).

Key Innovation: Pre-quantization AdamW TTT

Standard post-quant SGD TTT fails on GPTQ-quantized models (25 failures, PR #756). We run AdamW TTT on the full-precision EMA model before GPTQ:

Train 600s → EMA model (BPB 1.1463)
AdamW TTT: 6 epochs, freeze first 2 blocks, cosine LR → BPB 1.1189 (-0.027 gain)
Full Hessian GPTQ on adapted model → sliding BPB 1.1025

3-Seed Results

Seed	Sliding BPB	Artifact
1337	1.1023	15,930,573
42	1.1037	15,985,137
2025	1.1016	15,935,233
Mean	1.1025

Compliance

No SLOT, no n-gram cache, no eval-time adaptation
Pre-quant TTT adapts model before any eval scoring (Conditions 1-4 satisfied)
Full Hessian GPTQ calibrated on training data (inside 600s budget)

Reproduction

SEED=1337 torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

PR #1019 (@abaybektursun), PR #1306 (pre-quant TTT concept), PR #1125 (QK-Gain), PR #478 (XSA-all), PR #535 (GPTQ), PR #493 (LeakyReLU²)

Pre-quant TTT (6ep AdamW on EMA before GPTQ) gives -0.027 BPB gain. 3 seeds: 1.1023, 1.1037, 1.1016 (mean 1.1025, std 0.0011). All artifacts under 16MB. No SLOT, no n-gram. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

primary path - CRITICAL: PR openai#1351 (Discriminative TTT, 1.0807) self-closed by author on 2026-04-05 — pre-quant AdamW TTT ruled as pre-eval adaptation on val data. Removed pre-quant TTT from technique table and plan. - Updated strategy to PR openai#1334 (Depth Recur + Parallel Residuals + MuonEq-R, 1.0897) as primary architecture target — zero legality flags. - Logged new PRs: openai#1379 (0.4162, n-gram mixer), openai#1376 (0.7094, SLOT-24 + pre-quant TTT), openai#1364 (1.1025, pre-quant TTT at risk), openai#1370 (1.003, GDN). - SLOT and pre-quant TTT both blocked; discriminative TTT post-quant still legal. - Updated CLAUDE.md Competition Strategy + Technique Reference + Lessons (v9.0). https://claude.ai/code/session_01RTLvTuYBp9YMtudwrY8mYM

@clarkkev

…ed mean) Merges @clarkkev's openai#1394 (SP8192, SDClip, GPTQ embeddings, skip gates) with @stukenov's openai#1364 (pre-quant AdamW TTT). First combination of these techniques. 3-seed mean: 1.07948 BPB (std=0.00043), artifact 15.12 MB. Built with Claude Opus 4.6 as AI co-author. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

This was referenced Apr 6, 2026

Record: SP8192 + Pre-Quant TTT — val_bpb 1.07948 (3-seed mean) #1416

Open

Record: Combined 3-Layer Recurrence + Parallel Residuals + Polar Express + Brotli — val_bpb 1.1067 (3-seed mean) #1396

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean)#1364

Record: Pre-quant AdamW TTT + QK-Gain 4.0 — val_bpb 1.1025 (3-seed mean)#1364
stukenov wants to merge 1 commit intoopenai:mainfrom
stukenov:submission/v6-safe-prequant-ttt

stukenov commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

stukenov commented Apr 4, 2026

Summary

Key Innovation: Pre-quantization AdamW TTT

3-Seed Results

Compliance

Reproduction

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant